Learning String Edit Distance^ 

Eric Sven Ristad Peter N. Yianilos 

Research Report CS-TR-532-96 
October 1996: Revised October 1997 



Abstract 

In many applications, it is necessary to determine the similarity ol two strings. 
A widely-used notion of string similarity is the edit distance: the minimum 
number of insertions, deletions, and substitutions required to transform one 
string into the other. In this report, we provide a stochastic model for string 
edit distance. Our stochastic model allows us to learn a string edit distance 
function from a corpus of examples. We illustrate the utility of our approach 
by applying it to the difficult problem of learning the pronunciation of words in 
conversational speech. In this application, we learn a string edit distance with 
one fourth the error rate of the untrained Levenshtein distance. Our approach 
is applicable to any string classification problem that may be solved using a 
similarity function against a database of labeled prototypes. 
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1 Introduction 



In many applications, it is necessary to determine the similarity of two strings. 
A widely-used notion of string similarity is the edit distance: the minimum 
number of insertions, deletions, and substitutions required to transform one 
string into the other JlJ]. In this report, we provide a stochastic model for 
string edit distance. Our stochastic interpretation allows us to automatically 
learn a string edit distance from a corpus of examples. It also leads to a variant 
of string edit distance, that aggregates the many different ways to transform one 
string into another. We illustrate the utility of our approach by applying it to 
the difficult problem of learning the pronunciation of words in the Switchboard 
corpus of conversational speech ||]. In this application, we learn a string edit 
distance that reduces the error rate of the untrained Levenshtein distance by a 
factor of four. 

Let us first define our notation. Let A be a finite alphabet of distinct symbols 
and let x T G A T denote an arbitrary string of length T over the alphabet A. 
Then x\ denotes the substring of x T that begins at position i and ends at 
position j. For convenience, we abbreviate the unit length substring x\ as Xi 
and the length t prefix of x T as x l . 

A string edit distance is characterized by a triple (A, B, c) consisting of the 
finite alphabets A and B and the primitive cost function c : E — > 5R + where 
is the set of nonnegative reals, E = E s U Ed U Ei is the alphabet of primitive 
edit operations, E s = A x B is the set of the substitutions, Ed = A x {e} is the 
set of the deletions, and Ei = {e} x B is the set of the insertions. Each such 
triple (A, B, c) induces a distance function d c : A* x B* — > 5R + that maps a pair 
of strings to a nonnegative value. The distance d c (x t ,y v ) between two strings 
x l G A t and y v G B v is defined recursively as 

f c{x t ,y v ) + d^x 1 - 1 ^-- 1 ), ) 
d c {x\y v ) =rmn\ c{x u e) + y v ), } (1) 

where d c (e, e) = 0. The edit distance may be computed in 0{t-v) time using dy- 
namic programming fllq, [28| . Many excellent reviews of the string edit distance 
literature are available |10|, [13[ |l| , 26 . Several variants of the edit distance have 
been proposed, including the constrained edit distance [l7|l and the normalized 
edit distance ]lq| . 

A stochastic interpretation of string edit distance was first provided by Bahl 
and Jelinek [0, but without an algorithm for learning the edit costs. The need 
for such a learning algorithm is widely acknowledged [l8| |2(| . The principal 
contribution of this report is an efficient algorithm for learning the primitive edit 
costs from a corpus of examples. To the best of our knowledge, this is the first 
published algorithm to automatically learn the primitive edit costs. We initially 
implemented a two-dimensional variant of our approach in August 1993 for the 
problem of classifying greyscale images of handwritten digits. 
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The remainder of this report consists of four sections and two appendices. 
In section |[ we define our stochastic model of string edit distance and provide 
an efficient algorithm to learn the primitive edit costs from a corpus of string 
pairs. In section we provide a stochastic model for string classification prob- 
lems, and provide an algorithm to estimate the parameters of this model from 
a corpus of labeled strings. Our techniques are applicable to any string classi- 
fication problem that may be solved using a string distance function against a 
database of labeled prototypes. In section |], we apply our modeling techniques 
to the difficult problem of learning the pronunciations of words in conversational 
speech. 

In appendix we present results for the pronunciation recognition problem 
in the classic nearest-neighbor paradigm. In appendix we present an alternate 
model of string edit distance, which is conditioned on string lengths. 

2 String Distance 

We model string edit distance as a memoryless stochastic transduction between 
the underlying strings A* and the surface strings B* . Each step of the transduc- 
tion generates either a substitution pair (a, 6), a deletion pair (a, e), an insertion 
pair (e,b), or the distinguished termination symbol # according to a probability 
function 5 : E U {#} — > [0, 1]. Being a probability function, 8{-) satisfies the 
following constraints: 

a. Vz G EU {#} [ < 5{z) < 1 ] 
b - T,zeEu{#} 6 ( z ) = 1 

Note that the null operation (e, e) is not included in the alphabet E of edit 
operations. 

A memoryless stochastic transducer cf) = (A, B, 8) naturally induces a prob- 
ability function p(-\4>) on the space of all terminated edit sequences. This 
probability function is defined by the following generation algorithm. 

GENERATE^) 

1. For n = 1 to oo 

2. pick z n from E U {#} according to S(-) 

3. if z n = # [ return(z n ); ] 

In our intended applications, we require a probability function on string 
pairs rather than on edit sequences. In order to obtain such a probability 
function, we consider a string pair to be the equivalence class representative for 
all edit sequences whose yield is that pair. Thus, the probability of a string 
pair is the sum of the probabilities of all edit sequences for that string pair. Let 
v{z n jf) 6 A* x B* be the yield of the terminated edit sequence z n #. Then we 
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define p(x T , y v \4>) to be the probability of the complex event v 1 ((x T , y v )), 

p(x T ,y v \cf > ) = J2 P{z n #\4>) (2) 

{z»# :I ,(z»#)=<* T ,l/ V )} 

where the probability p(z n #\(j)) of a terminated edit sequence z n 6 E n is simply 
the product of the probabilities 8{zi) of the individual edit operations because 
the transducer is memoryless. 

Theorem 1 p(-,-\<fr) is a valid probability function on A* x B* if and only if 
S(-) is valid and 5(=tf=) > 0. 

Proof. If S(-) is a valid probability function and > 0, then p(-\<f>) is a 

valid probability function on the set of all finite terminated edit sequences 
because E*# is a complete prefix-free set. Each terminated edit sequence z n # 
yields exactly one string pair v(z n jf). Therefore, the set A* x B* partitions the 
set £*# &ndp(A* x B*\<j>) = 1. 

If 8(#) = 0, then p(z n #\(p) = p(z n \4>)5(#) = for all finite terminated edit 
sequences and p{A* x B*\(j>) — because all string pairs in A* x B* are finite. 
If S(-) is not valid, then p(z n #\4>) is invalid and p(A* x B*\<p) must be invalid 
as well. □ 

The use of a distinguished termination symbol # in a memoryless process 
entails that the probability of an edit sequence decays exponentially with its 
length. More importantly, the probability p(n\(f>) that an edit sequence will 
contain n operations must also decrease uniformly at an exponential rate. 

P{n\cj>) = E z „ eB „ P (z"#|0) 
= (1 -*(#))»*(#) 

In many natural processes, such as those involving communication, the probabil- 
ity of an edit sequence does not decrease uniformly. More probability is assigned 
to the medium-length messages than to the very short messages. As formulated, 
the memoryless transducer is unable to accurately model such processes. In ap- 
pendix [b|, we present an alternate parameterization of the transducer without 
a termination symbol. In the alternate parameterization, we directly model the 
probability p(T, V) that the underlying string contains T symbols and the sur- 
face string contains V symbols. As a result, the probability of the length n of 
the underlying edit sequence need not decrease exponentially. 

The remainder of this section explains how to use the memoryless stochastic 
transducer as a string edit distance. First we use the stochastic transducer to 
define two string edit distances: the Viterbi edit distance and the stochastic 
edit distance. We show how to efficiently evaluate the joint probability of a 
string pair according to a given transducer <f>. This computation is necessary to 
calculate the stochastic edit distance between two strings. Next, we explain how 
to optimize the parameters of a memoryless transducer on a corpus of similar 
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string pairs. This computation is equivalent to learning the primitive edit costs. 
Finally, we present three variants on the memoryless transducer, which lead to 
three variants of the two string edit distances. Subsequently, section ^ explains 
how to solve string classification problems using a stochastic transducer. 

2.1 Two Distances 

Our interpretation of string edit distance as a stochastic transduction naturally 
leads to the following two string distances. The first distance cE(-, •) is defined by 
the most likely transduction between the two strings, while the second distance 
d^(-, •) is defined by aggregating all transductions between the two strings. 

The first transduction distance d^(x T , y v ), which we call the Viterbi edit 
distance, is the negative logarithm of the probability of the most likely edit 
sequence for the string pair (x T , y v ). 

dl{x T ,y v ) = -logargmax {2 „ :l/(2 „ )=(;r T a v- )} {p(z n \<p)} (3) 

This distance function is identical to the string edit distance d c (-, •) where the 
edit costs are set to the negative logarithm of the edit probabilities, that is, 
where c(z) = — log S(z) for all z 6 E. 

The second transduction distance d^(x T ,y v ), which we call the stochastic 
edit distance, is the negative logarithm of the probability of the string pair 
(x T ,y v ) according to the transducer <f>. 

d° > (x T ,y v ) = -\ogp(x T ,y v \<f>) (4) 

This second distance differs from the first in that it considers the contribution 
of all ways to simultaneously generate the two strings. If the most likely edit 
sequence for (x T ,y v ) is significantly more likely than any of the other edit 
sequences, then the two transduction distances will be nearly equal. However, if 
a given string pair has many likely generation paths, then the stochastic distance 
■) can be significantly less than the Viterbi distance dfc(; •). 
Unlike the classic edit distance d c (4>,4>), our two transduction distances are 
never zero unless they are infinite for all other string pairs. Recall that the 
Levenshtein distance assigns zero cost to all identity edit operations. Therefore, 
an infinite number of identity edits is less costly than even a single insert, delete, 
or substitute. The only way to obtain this property in a transduction distance is 
to assign zero probability (ie., infinite cost) to all nonidentity operations, which 
would assign finite distance only to pairs of identical strings. Note that such 
a transducer would still assign linearly increasing distance to pairs of identical 
strings, unlike the Levenshtein distance. 

2.2 Evaluation 

Our generative model assigns probability to terminated edit sequences and the 
string pairs that they yield. Each pair of strings may be generated by many 
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different edit sequences. Therefore we must calculate the probability of a pair 
of strings by summing the probability p{z n #\(j>) over all the terminated edit 
sequences that yield the given string pairs (||). 

Each string pair is generated by exponentially many edit sequences, and so 
it would not be feasible to evaluate the probability of a string pair by actually 
summing over all its edit sequences. The following dynamic programming al- 
gorithm, due to Bahl and Jelinek ||, calculates the probability p(x T , y v \<j>) in 
0(T -V) time and space. At the end of the computation, the a t . v entry contains 
the probability p(x t , y v \4>) of the prefix pair (x , y°) and o.t,v is the probability 
of the entire string pair. 

FORWARD-EVALUATE(x T ,y V ,0) 

1- «o,o := 1; 

2. Fort = 0...T 

3. For v = . . . V 

4. if (v > 1 Vt> 1) [ a t , v ■■= 0; ] 

5. if (v > 1) [ a t , v += S(e, y v )at,v-i, ] 

6. if (t > 1) [ a t ,v += S(x t ,e)a t -i, v ; ] 

7. if (v > 1 At > 1) [ ett, v += 5(x t ,yv)at-i,v-i, ] 

8. a T y *= 

9. return(a); 

The space requirements of this algorithm may be reduced to 0(min(T, V)) 
at some expense in clarity. 

2.3 Estimation 

Under our stochastic model of string edit distance, the problem of learning the 
edit costs reduces to the problem of estimating the parameters of a memoryless 
stochastic transducer. For this task, we employ the powerful expectation max- 
imization (EM) framework || [IJ [j| . An EM algorithm is an iterative algorithm 
that maximizes the probability of the training data according to the model. 
Sec M] for a review. The applicability of EM to the problem of optimizing 
the parameters of a memoryless stochastic transducer was first noted by Bahl, 
Jelinek, and Mercer although they did not publish an explicit algorithm 

for this purpose. 

As its name suggests, an EM algorithm consists of two steps. In the expec- 
tation step, we accumulate the expectation of each hidden event on the training 
corpus. In our case the hidden events are the edit operations used to generate 
the string pairs. In the maximization step, we set our parameter values to their 
relative expectations on the training corpus. 

The following expectation-MAXIMIZATIOn() algorithm optimizes the pa- 
rameters 4> of a memoryless stochastic transducer on a corpus C = (x Tl , y Vl ) , 
. . ., (x Tn , y Vn ) of n training pairs. Each iteration of the EM algorithm is guar- 
anteed to either increase the probability of the training corpus or not change the 
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model parameters. The correctness of our algorithm is shown in related work 



EXPECTATION-MAXIMIZATION(0, C) 

1. until convergence 



2. forall z in E [ 7 (z) := 0; ] 

3. for i = 1 to n 

4. EXPECTATiON-STEP(a; Ti , y Vi , </>, 7,1); 

5. MAXIMIZATION-STEP(0,7); 



The "f(z) variable accumulates the expected number of times that the edit 
operation z was used to generate the string pairs in C. Convergence is achieved 
when the total probability of the training corpus does not change on consecutive 
iterations. In practice, we typically terminate the algorithm when the increase 
in the total probability of the training corpus falls below a fixed threshold. 
Alternately, we might simply perform a fixed number of iterations. 

Let us now consider the details of the algorithm, beginning with the expec- 
tation step. First we define our forward and backward variables. The forward 
variable ctt. v contains the probability pix 1 , y v \4>) of generating the pair (x t ,y v ) 
of string prefixes. These values are calculated by the forward-evaluate() 
algorithm given in the preceding section. 

The following backward-evaluate() algorithm calculates the backward 
values. The backward variable (3 t ,v contains the probability p(xf +1 , y^ +1 \<fi, (t,v)) 
of generating the terminated suffix pair (xf +1 ,y^ +1 ). Note that /3o,o is equal to 
o<t,v- 

backward-evaluate^ 7 , y v ,(/)) 

1. T<V := *(#); 

2. fort = T...O 



3. for v = V . . . 

4. if (v < V V t < T) [ /3 M := 0; ] 

5. if (v < V) [ ft,„ += 5(e,y v+1 )f3t,v+i; } 

6. \f(t<T)[0t, v +=5(xt+ue)l3 

7. if (v < V At < T) [ fa >v += 5(x t+ i,y v+ i)0t+i, v+ i; ] 



8. return(/3); 

Recall that 7(2) accumulates the expected number of times the edit opera- 
tion z was used to generate a given the string pair. These values are calculated 
by the following expectation-STEp() algorithm, which assumes that the 7 
accumulators have been properly initialized. The A argument weights the ex- 
pectation accumulation; it is used below when we learn a string classifier. For 
the purposes of this section, A is always unity. 

expectation-step^ 71 , y y ,<?!>,7, A) 

1. a :— forward-evaluate(x t ,j/ v ,(/)); 
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2. (5 :— BACKWARD-EVALUATE(a;' r ,?/ y ,0); 

3. if (qt v = 0) [ return; ] 

4. 7(#)+=A; 

5. for£ = 0...T 

6. for v = . . . V 

7. if (t > 0) [ 7(z t , e) += \at-i, v 5(x u e)0 t , v /a T ,v m , ] 

8. if (v > 0) [ 7(e, $/„) += \a t , v -i6(e, y v )(it.v/a T .y\ ] 

9. if (t > A v > 0) [ j(x u y v ) += Xatt-i,v-i8(x t ,y v )Pt, v /aT,v: ] 

Recall that dry and /?o,o both contain p(x T ,y v \<fi) after lines 1 and 2, 
respectively. Line 7 accumulates the posterior probability that we were in state 
(t — l,v) and emitted a (x t , e) deletion operation. Similarly, line 8 accumulates 
the posterior probability that we were in state (t,v — 1) and emitted a (e,y v ) 
insertion operation. Line 9 accumulates the posterior probability that we were 
in state (t — l,v— 1) and emitted a {x t ,y v ) substitution operation. 

Given the expectations 7 of our edit operations, the following maximization- 
STEP0 algorithm updates our model parameters </>. 

MAXIMIZATION-STEP(</),7) 

1. ^V:= 7 (#); 

2. forall z in E [ N += 7(2); ] 

3. forall z in E [ 6(z) :=-/(z)/N; ] 

4. *(#):= 7 (#)/j\T; 

The expectation-STEp() algorithm accumulates the expectations of edit 
operations by considering all possible generation sequences. It is possible to 
replace this algorithm with the viterbi-expectation-STEp() algorithm, which 
accumulates the expectations of edit operations by only considering the single 
most likely generation sequence for a given pair of strings. The only change to 
the EXPECTATION- step () algorithm would be to replace the subroutine calls in 
lines 1 and 2. Although such a learning algorithm is arguably more appropriate 
to the original string edit distance formulation, it is less suitable in our stochastic 
model of string edit distance and so we do not pursue it here. 

Convergence. The expectation-MAXIMIZATIOn() algorithm given above is 
guaranteed to converge to a local maximum on a given corpus C, by a reduction 
to finite growth models |2j| . Here we demonstrate that there may be multiple 
local maxima, and that only one of these need be a global maxima. 

Consider a transducer <f> with alphabets A — {a, b} and B = {c} being 
trained on a corpus C consisting of exactly one string pair (abb, cc). We restrict 
our attention to local maxima that are attainable without initializing any model 
parameter to zero. Then, depending on how (f> is initialized, EM may converge 
to one of the following three local maxima. 
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(a,c) 


(b,c) 


(a,e> 


(M> 


-log 2 p(C\4>) 





2/3 


1/3 





2.75 


1/3 


1/3 


1/3 





3.75 


2/9 


4/9 


1/9 


2/9 


3.92 



The global optimum is at <5((a, e}) = 1/3 and 6((b,c)) = 2/3, for which 
p(C\4>) = 4/27 (2.75 bits). This maxima corresponds to the optimal edit se- 
quence (a, e)(b, c)(b, c), that is, to left-insert a and then perform two (b, c) 
substitutions. 

A second local maxima is at <5((a, c)) = 1/3, 8((b, c)) = 1/3, and <$((a, e)) = 
1/3, for which p(C\4>) = 2/27 (3.75 bits). This maxima corresponds to the 
following two edit sequences each occurring with probability 1/27: 

(a,c)(b,c)(b,e) 
(a,c)(b,e)(b,c) 

A third local maxima is at <5((a, c)) = 2/9, S((b, c}) = 4/9, <5((a,e}) = 1/9, 
and 5({b,e)) = 2/9 for which p(C\<f>) = 16/243 (3.92 bits). This maxima cor- 
responds to the following three edit sequences, each occurring with probability 
16/729. 

(a, e)(b,c)(b, c) 
(a, c)(b,c)(b,e> 
(a, c)(b,e)(b, c) 

Our experience suggests that such local maxima are not a limitation in 
practice, when the training corpus is sufficiently large. 

2.4 Three Variants 

Here we briefly consider three variants of the memoryless stochastic transducer. 
First, we explain how to reduce the number of free parameters in the transducer, 
and thereby simplify the corresponding edit cost function. Next, we propose a 
way to combine different transduction distances using the technique of finite 
mixture modeling. Finally, we suggest an even stronger class of string distances 
that are based on stochastic transducers with memory. A fourth variant - the 
generalization to A;- way transduction - appears in related work Jm|, . 

2.4.1 Parameter Tying 

In many applications, the edit cost function is simpler than the one that we 
have been considering here. The most widely used edit distance has only four 
distinct costs: the insertion cost, the deletion cost, the identity cost, and the 
substitution cost.^] Although this simplification may result in a weaker edit 

2 Bunke and Csirik |J propose an even weaker "parametric edit distance" whose only free 
parameter is a single substitution cost r. The insertion and deletion costs are fixed to unity 
while the identity cost is zero. 
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distance, it has the advantage of requiring less training data to accurately learn 
the edit costs. In the statistical modeling literature, the use of such parameter 
equivalence classes is dubbed parameter tying. 

It is straightforward to implement arbitrary parameter tying for memoryless 
stochastic transducers. Let t(z) be the equivalence class of the edit operation 
z, t(z) <E 2 e , and let S(t(z)) = X^'gWz) ^( z> ) ^ e ^ ne total probability assigned 
to the equivalence class t(z). After maximization, we simply set 8{z) to be 
uniform within the total probability 5{t(z)) assigned to t{z). 

6(z) :=S(r(z))/\r(z)\ 

2.4.2 Finite Mixtures 

A /c-component mixture transducer <fi = (A, B, fi,5) is a linear combination of 
k memoryless transducers defined on the same alphabets A and B. The mixing 
parameters [i form a probability function, where \ii is the probability of choosing 
the i th memoryless transducer. Therefore, the total probability assigned to a 
pair of strings by a mixture transducer is a weighted sum over all the component 
transducers. 

k 

p{x\y v \4>) = ^p(x\y v \{A : B,5^)p l 

i=l 

A mixture transducer combines the predictions of its component transducers in 
a surprisingly effective way. Since the cost — log /ii of selecting the i th compo- 
nent of a mixture transducer is insignificant when compared to the total cost 
— \ogp(x t , y v \(j)i) of the string pair according to the i th component, the string 
distance defined by a mixture transducer is effectively the minimum over the k 
distances defined by its k component transducers. 

Choosing the components of a mixture transducer is more of an art than a 
science. One effective approach is to combine simpler models with more com- 
plex models. We would combine transducers with varying degrees of parameter 
tying, all trained on the same corpus. The mixing parameters could be uni- 
form, ic., Hi = 1/fc, or they could be optimized using withheld training data 
(cross-estimation) . 

Another effective approach is to combine models trained on different corpora. 
This makes the most sense if the training corpus consists of naturally distinct 
sections. In this setting, we would train a different transducer on each section 
of the corpus, and then combine the resulting transducers into a mixture model. 
The mixing parameters could be set to the relative sizes of the corpus sections, 
or they could be optimized using withheld training data. For good measure, we 
could also include a transducer that was trained on the entire corpus. 
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2.4.3 Memory 

From a statistical perspective, the memoryless transducer is quite weak because 
consecutive edit operations are independent. A more powerful model - the 
stochastic transducer with memory - would condition the probability S(z t |z|Z n ) 
of generating an edit operation z t on a finite suffix of the edit sequence that 
has already been generated. Alternately, we might condition the probability 
of an edit operation z t on (a finite suffix of) the yield v{z l ~ 1 )) of the past 
edit sequence. These stochastic transducers can be further strengthened with 



state-conditional interpolation JL2, 23 or by conditioning our edit probabilities 
S(z t \zlZn, s) on a hidden state s drawn from a finite state space. The details 
of this approach, which is strictly more powerful than the class of transducers 
considered by Bahl and Jelinek Q , are presented in forthcoming work. 

3 String Classification 

In the preceding section, we presented an algorithm to automatically learn a 
string edit distance from a corpus of similar string pairs. Unfortunately, this 
algorithm cannot be directly applied to solve string classification problems. In 
a string classification problem, we arc asked to assign strings to a finite number 
of classes. To learn a string classifier, we are presented with a corpus of labeled 
strings, not pairs of similar strings. Here we present a stochastic solution to 
the string classification problem that allows us to automatically and efficiently 
learn a powerful string classifier from a corpus of labeled strings. Our approach 
is the stochastic analog of nearest-neighbor techniques. 

For string classification problems, we require a conditional probability p(w\y v ) 
that the string y v belongs to the class w. This conditional may be obtained 
from the joint probability p{w,y v ) by a straightforward application of Bayes' 
rule: p{w\y v ) — p(w, y v )/p(y v )- In this section, we explain how to automatically 
induce a strong joint probability model p(w, y v \L, </>) from a corpus of labeled 
strings, and how to use this model to optimally classify unseen strings. 



We begin by defining our model class in section 3.1. In section 3.2 we explain 



how to use our stochastic model to optimally classify unseen strings. Section 3.3 



explains how to estimate the model parameters from a corpus of labeled strings. 
3.1 Hidden Prototype Model 

We model the joint probability p(w, y v ) as the marginal of the joint probability 
p{w, x 1 , y v ) of a class w, an underlying prototype a;', and an observed string y v 

p{w,y v ) = p( w > xt >v v )- 

The prototype strings are drawn from the alphabet A while the observed strings 
are drawn from the alphabet B. Next, we model the joint probability p(w, x l , y v ) 
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as a product of conditional probabilities, 



piw^rflhL) = p(w\x t ,L)p(x t ,y v \<f>) 



(5) 



where the joint probability p(x l ,y v \4>) of a prototype x l and a string y v is deter- 
mined by a stochastic transducer <fi, and the conditional probability p(w\x t ,L) 
of a class w given a prototype x t is determined from the probabilities p(w, x*\L) 
of the labeled prototypes (w, x ) in the prototype dictionary L. This model has 
only 0(\L\ + \A x £?|) free parameters: |L| — 1 free parameters in the lexicon 
model p(w, x t \L) plus {\A\ + 1) • (\B\ + 1) — 1 free parameters in the transducer 
<j> over the alphabets A and B. 

We considered the alternate factorization p(w, x t ,y v \(f>, L) — p(y l '\x l , <j))p{w , x l \L) 
but rejected it as being inconsistent with the main thrust of our paper, which 
is the automatic acquisition and use of joint probabilities on string pairs. We 
note, however, that this alternate factorization has a more natural generative 
interpretation as a giant finite mixture model with \L\ components whose mix- 
ing parameters arc the probabilities p(w,x t \L) of the labeled prototypes and 
whose component models are the conditional probabilities p(y v \x t ,(j)) given by 
the transducer in conjunction with the underlying form x f . This alternate fac- 
torization suggests a number of extensions to the model, such as the use of class- 
conditional transducers p(y v \x t , 4> w ) and intra-class parameter tying schemes. 

3.2 Optimal Classifier 

The conditional probability p{w\y v ) 1 in conjunction with an application-specific 
utility function fi : W x W — > 5R, defines a classifier 



that maximizes the expected utility of the classification, where n(u\w) is the 
utility of returning the class u when we believe that the true class is w. 
For each string y v , the minimum error rate classifier outputs w 



where L(w) is the set of prototype strings for the class w. This decision rule 
correctly aggregates the similarity between an observed string and all prototypes 
for a given class. 




w = argmax^ {p(w\y v , 4>,L)} 
= 3rg™ax w {p(w,y v \<j),L)} 
= argmax w {Y /x t eA ,p(w,x t ,y v \(t),L)} 
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3.3 Estimation 

Given a prototype lexicon L : Wx2 A — > [0, 1] and a corpus C = (wi, y Vl ),..., (w n , y Vn ) 
of labeled strings, we estimate the parameters of our model (§) using expectation 
maximization for finite mixture models If the prototype dictionary is not 
provided, one may be constructed from the training corpus. Our EM algorithm 
will maximize the joint probability of the corpus. 

mixture-expectation-maximization(0, L, C) 

1. until convergence 

2. forall z\n E [ j(z) := 0; ] 

3. forall {w, x T ) in L [ -f(w, x T ) := 0; ] 

4. for i = 1 to n 

5. MIXTURE-EXPECTATION-STEP(wi , y V * , (j), L, 7); 

6. MIXTURE-MAXIMIZATION-STEP((/),L,7); 

Lines 2-3 initialize the 7 expectation accumulators. In practice, it is advis- 
able to add a small constant to the 7 accumulators so that no probability is 
optimized to zero.0 Lines 4-5 take an expectation step on every labeled string 
in the training corpus. Each expectation step increments the 7 accumulators, 
unless p(wi, y Vi \cj>, L) is zero. Finally, line 6 updates the model parameters in <p 
and L based on the accumulated expectations in 7. 

The heart of the EM algorithm is the MlXTURE-EXPECTATlON-STEP() pro- 
cedure. 

MIXTURE-EXPECTATION-STEP(ui,?/ V ' ,(j),L,-f) 

1. Z := 

2. forall x T in L(w) 

3. a(x T ) := L(w,x T )/L{x T ); 

4. a(x T ) *— FORWARD-EVALUATE(x T ,y y , <?!>); 

5. Z += a(x T ); 

6. forall x T in L(w) 

7. 7(w, x T ) += a(x T )/Z; 

8. EXPECTATION-STEP(x T ,y y ,(/>,7,a(:r T )/Z); 

Lines 1-5 accumulate the posterior probabilities p(x T \w, y v , cf>, L) for all pro- 
totypes x T G L(w). p(x T \w,y v ,4>,L) is the probability that the labeled proto- 
type (w,x T ) generated the observed string y v with known label w. 

1 t 1 v * t\ p(w,x T ,y v \(j),L) 
p{x \w,y ,<p,L)- 



J2 x -r eL{w) p(w,x T ,y v \(l),L) 



3 In our experiments below, we initialize -f(z) to because we have sufficient training data 
for the transducer. ■y(w,x T ) is initialized to 0.1 because our prototype dictionary is at least 
as large as our training corpus. 
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Line 3 computes p(w\x T ,L) from p(w,x t \L)/p(x t \L) while line 4 computes 
p(x T , y v \4>) . Next, line 7 accumulates expectations for the labeled prototypes 
(w, x T ) in L. At the end of the first loop, Z holds the marginal p(w, y v \4>, L). 
The second loop accumulates expectations for L and <fi. Line 7 accumulates ex- 
pectations for the labeled prototypes in L, in order to reestimate the p(w, x l \L) 
parameters of our lexicon. Line 8 takes a weighted expectation step for the 
transducer <f> on the string pair (x T ,y v ). The weight a(x T )/Z is the posterior 
probability p(x T \w,y v , <j>, L). As a result, this learning algorithm only trains 
the transducer on similar strings. 

All that remains is to provide the MIXTURE-MAXIMIZATION-STEp() algo- 
rithm, which is straightforward. 

MIXTURE-MAXIMIZATION-STEP(0,L,7) 

1. N:=0; 

2. foraN (to, a:*} in L [ N += r y(w,x t ); ] 

3. forall (uj.x*) in L [ L(w,x?) := ^{w,x*)/N\ ] 

4. MAXIMIZATION-STEP(</>,7); 

Note that maximizing the joint probability p(w, y \4>, L) is not the same as 
maximizing the conditional probability p(w\y v , <j>, L). The algorithms presented 
here maximize the joint probability, although they may be straightforwardly 
adapted to the later objective. Unfortunately, neither objective is the same as 
minimizing the error rate, although they are closely related in practice. 

Our approach to string classification has the additional virtue of being able 
to learn a new class from only a single example of that class, without any 
retraining. In the case of the pronunciation recognition problem considered 
below, we can learn to recognize the pronunciations of new words from only a 
single example of the new word's pronunciation. This possibility is suggested by 
the superior performance of our techniques in experiments E3, where the lexicon 
is constructed from training examples only, without any human intervention. 

In appendix ^ we consider another approach to the string classification prob- 
lem based on the classic "nearest neighbor" decision rule. In this ad-hoc ap- 
proach, we learn a string edit distance using all valid pairs (x 1 , y * ) of underlying 
forms x* £ L(wi) and surface realizations y * for each word Wi in the training 
corpus. For each phonetic string y Sj in the testing corpus C", we return the word 
ij in D that minimizes the string distance <i(a;' , y Si ) among all lexical entries 
(v, x l ) € L. Although this approach is technically simple, it has the unfortunate 
property of training the transduction distances on both similar and dissimilar 
pairs of strings. Consequently, the performance of the transduction distances 
trained using this approach are not appreciably different from the performance 
of the untrained Levenshtein distance. Experimental results obtained using this 
ad-hoc approach are also included in the appendix. 
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4 An Application 



In this section, we apply our techniques to the problem of learning the pronun- 
ciations of words. A given word of a natural language may be pronounced in 
many different ways, depending on such factors as the dialect, the speaker, and 
the linguistic environment. We describe one way of modeling variation in the 
pronunciation of words. Let W be the set of syntactic words in a language, let A 
be the set of underlying phonological segments employed by the language, and 
let B be the set of observed phonemes. The pronouncing lexicon L : W — > 2 A 
assigns a small set of underlying phonological forms to every syntactic word in 
the language. Each underlying form in A* is then mapped to a surface form 
in B* by a stochastic process. Our goal is to recognize phonetic strings, which 
will require us to map each surface form to the syntactic word for which it is a 
pronunciation. 

We formalize this pronunciation recognition (PR) problem as follows. The 
input to Pronunciation Recognition is a six-tuple (W, A, B, L,C,C) consisting 
of a set W of syntactic words, an alphabet A of phonological segments, an 
alphabet B of phonetic segments, a pronouncing lexicon L : W — > 2 A , a training 
corpus C = (wi,y Vl ), (w n ,y Vn ) of labeled phonetic strings, and a testing 
corpus C — y Sl , . . . , y Sm of unlabeled phonetic strings. Each training pair 
(wi,y Vi ) in C includes a syntactic word Wi, iw, s W, along with a phonetic 
string y v * G B Vi . The output is a set of labels v\, . . . ,v m for the phonetic 
strings in the testing corpus C . 

The pronunciation recognition problem may be reduced to the string classi- 
fication problem: the syntactic words are the classes, the underlying forms are 
the prototype strings, and the surface forms are the surface strings in need of 
classification. So let us now apply our stochastic solution to the Switchboard 
corpus of conversational speech. 

4.1 Switchboard Corpus 

The Switchboard corpus contains over 3 million words of spontaneous telephone 
speech conversations It is considered one of the most difficult corpora for 
speech recognition (and pronunciation recognition) because of the tremendous 
variability of spontaneous speech. As of Summer 1996, speech recognition tech- 
nology has a word error rate above 45% on the Switchboard corpus. The same 
speech recognition technology achieves a word error rate of less than 5% on read 
speech. 

Over 200,000 words of Switchboard have been manually assigned phonetic 
transcripts at ICSI using a proprietary phonetic alphabet ||. The Switchboard 
corpus also includes a pronouncing lexicon with 71,100 entries using a modified 
Pronlex phonetic alphabet (long form) jl). In order to make the pronouncing 
lexicon compatible with the ICSI corpus of phonetic transcripts, we removed 
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148 entries from the lexicon and 73,068 samples from the ICSI corpus.^ After 
filtering, our pronouncing lexicon had 70,952 entries for 66,284 syntactic words 
over an alphabet of 42 phonemes. Our corpus had 214,310 samples - of which 
23,955 were distinct - for 9,015 syntactic words with 43 phonemes (42 Pronlex 
phonemes plus a special "silence" symbol). 

4.2 Four Experiments 

We conducted four sets of experiments using seven models. In all cases, we 
partitioned our corpus of 214,310 samples 9:1 into 192,879 training samples and 
21,431 test samples. In no experiment did we adapt our probability model ([s]) 
to the test data. 

Our seven models consist of Levenshtein distance Cj] as well as six variants 
resulting from our two interpretations of three models .HOur two interpretations 
are the stochastic edit distance (^) and the classic edit distance (||), also called 
the Viterbi edit distance. For each interpretation, we built a tied model with 
only four parameters, an untied model, and a mixture model consisting of a 
uniform mixture of the tied and untied models. 

The transducer parameters are initialized uniformly before training, as are 
the parameters of the word model p(w\L) and the conditional lexicon model 
p(x t \w, L) for all entries {w, x l ) in L. Note that a uniform p(w\L) and a uniform 
p(x \w,L) are not equivalent to a uniform p{w,x t \L) because more frequent 
words tend to have more pronunciations in the lexicon. 

Our four sets of experiments are determined by how we obtain our pro- 
nouncing lexicon. The first two experiments use the Switchboard pronouncing 
lexicon. Experiment El uses the full pronouncing lexicon for all 66,284 words 
while experiment E2 uses the subset of the pronouncing lexicon for the 9,015 
words in the corpus. The second two experiments use a lexicon derived from 
the corpus. Experiment E3 uses the training corpus only to construct the pro- 
nouncing lexicon, while experiment E4 uses the entire corpus - both training 
and testing portions - to construct the pronouncing lexicon. The test corpus 
has 512 samples whose words did not appear in the training corpus, which lower 
bounds the error rate for experiment E3 to 2.4%. 

The principal difference among these four experiments is how much informa- 
tion the training corpus provides about the test corpus. In order of increasing 

4 Prom the lexicon, we removed 148 entries whose words had unusual punctuation ([<! .]). 
From the ICSI corpus, we removed 72,257 samples that were labeled with silence, 688 samples 
with an empty phonetic transcript, 88 samples with a fragmentary transcript due to interrup- 
tions, 27 samples with the undocumented symbol ?, and 8 samples with the undocumented 
symbol ! . Note that the symbols ? and ! are not part of either the ICSI phonetic alphabet 
or the Pronlex phonetic alphabet (long forms), and are only used in the ICSI corpus. 

5 The Levenshtein distance is the minimum number of insertions, deletions, and substitu- 
tions required to transform one string into another. Thus, the Levenshtein distance is a string 
edit distance where the cost of all identity substitutions is zero and all other edit costs are 
unity. 
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information, we have E3 < El < E2 < E4. In experiment E3, the pronouncing 
lexicon is constructed from the training corpus only and therefore E3 provides 
no direct information about the test corpus. In experiment El, the pronounc- 
ing lexicon was constructed from the entire 3m word Switchboard corpus, and 
therefore El provides weak knowledge of the set of syntactic words that appear 
in the test corpus. In experiment E2, the pruned pronouncing lexicon provides 
stronger knowledge of the set of syntactic words that actually appear in the 
test corpus, as well as their most salient phonetic forms. In experiment E4, the 
pronouncing lexicon provides complete knowledge of the set of syntactic words 
paired with their actual phonetic forms in the test corpus. 

The following table presents the essential characteristics of the lexicons used 
in the four experiments. 











entries 


novel 


entries 




entries 


words 


forms 


/word 


forms 


/sample 


El 


70,952 


66,284 


64,937 


1.070 


2908 


1.895 


E2 


9,621 


9,015 


9,343 


1.067 


3261 


1.267 


E3 


22,140 


8,570 


17,880 


2.583 


1773 


9.434 


E4 


23,955 


9,015 


19,355 


2.657 





10.027 



The first four fields of the table pertain to the lexicon alone. 'Entries' is 
the number of entries in the lexicon, 'words' is the number of unique words 
in the lexicon, 'forms' is the number of unique phonetic forms in the lexicon, 
and 'entries/word' is the mean number of entries per word. The final two fields 
characterize the relation between the lexicon and the test corpus, 'novel samples' 
is the number of samples in the test corpus whose phonetic forms do not appear 
in the lexicon, and 'entries/sample' is the mean number of lexical entries that 
exactly match the phonetic form of a sample in the test corpus. 

For each experiment, we report the fraction of misclassificd samples in the 
testing corpus (ie., the word error rate). Note that the pronouncing lexicons 
have many homophones. Our decision rule d : B* — > 2 L maps each test sample 
y Si to a subset d(y Si ) C L of the lexical entries. Accordingly, we calculate the 
fraction of correctly classified samples as the sum over all test samples of the 
ratio of the number of correct lexical entries in d(y Si ) to the total number of 
postulated lexical entries in d(y Si ). The fraction of misclassified samples is one 
minus the fraction of correctly classified samples. 

4.3 Results 

Our experimental results are summarized in the following table and figures. The 
table shows the word error rate for each model at the tenth EM iteration. After 
training, the error rates of the transduction distances are from one half to one 
sixth the error rate of the untrained Levenshtein distance. The stochastic and 
Viterbi edit distances have comparable performance. The untied and mixed 
models perform better than the tied model in experiments El, E2, and E3. 
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Leven- 


Stochastic Distance 




Viterbi Distance 






shtein 


Tied 


Untied 


Mixed 


Tied 




Untied 


Mixed 


El 


48.04 


20.87 


18.61 


18.74 


20.87 




18.58 




18.73 


E2 


33.00 


19.56 




17.14 




17.35 




19.63 




17.16 


17.35 




E3 


61.87 


14.60 


14.29 




14.28 




14.58 


14.29 




14.28 




E4 


56.35 




9.34 




9.36 


9.36 




9.34 




9.36 


9.36 



The error rate for experiment E3 is bounded below by 2.4% because the test 
corpus contains 512 out-of-vocabulary samplcsk in the E3 experiment. If we 
discard these samples, then the E3 error rate for the untied model would drop 
from 14.29% to 12.19%. 

A sparser lexicon entails a more complex mapping between underlying forms 
and surface forms. The E3 and E4 lexicons have 2.6 entries per word, while the 
El and E2 lexicons have only 1.1 entries per word. Consequently, the inferior 
performance of the transducer in El and E2 relative to E3 and E4 is best 
explained by the statistical weakness of a transducer without memory. The El 
lexicon has entries for 66,284 words while the E2 lexicon has entries only for the 
9,015 words that appear in the corpus. As a result, a significant amount of the 
p(w,x t \L) probability mass is assigned to words that do not appear in either 
the training or testing data in experiment El. This accounts for the relative 
performance of the transducer in El and E2. 

In experiment E4, the lexicon contains an entry for every sample in the 
test corpus. Since the Levenshtein distance between a surface form (in the test 
corpus) and an underlying form (in the lexicon) is minimized when the two forms 
are identical, we might expect the Levenshtein distance to achieve a perfect 0% 
error rate in experiment E4, instead of its actual 56.35% error rate. The poor 
performance of the Levenshtein distance in experiment E4 is due to the fact 
that the mapping from phonetic forms to syntactic words is many-to-many in 
the E4 lexicon. Each phonetic form in the test corpus appears in 10.027 entries 
in the E4 lexicon, on average. The most ambiguous phonetic form in the test 
corpus, "ah" , appears 528 times in the test corpus and exactly matches entries 
for the following 62 words in the E4 lexicon. 

a a_ all an and are at by bye don't for gaw have her high 
hm huh I I'll I'm I've I_ in it know little my no of oh old 
on or other ought our out pay see so that the them then there 
they those though to too uh uhhuh urn up us was we've what 
who would yeah you 

The great ambiguity of "ah" is due to transcription errors, segmentation 
errors, and the tremendous variability of spontaneous conversational speech. 

We believe that the superior performance of our statistical techniques in 
experiment E3, when compared to experiments El and E2, has two significant 
implications. Firstly, it raises the possibility of obsoleting the costly process 
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of making a pronouncing lexicon by hand. A pronouncing lexicon that is con- 
structed directly from actual pronunciations offers the possibility of better per- 
formance than one constructed in traditional ways. Secondly, it suggests that 
our techniques may be able to accurately recognize the pronunciations of new 
words from only a single example of the new word's pronunciation, without any 
retraining. We simply add the new word w with its observed pronunciation x 
into the pronouncing lexicon L, and assign the new entry a probability p(w, x l \L) 
based on its observed frequency of occurrence. The old entries in the lexicon 
have their probabilities scaled down by 1 — p{w,x t \L), and the transducer (f> 
remains constant. 

4.4 Credit Assignment 

Recall that our joint probability modelp(u>, x', y v \<j>, L) is constructed from three 
separate models: the conditional probability p(w\x t ,L) is given by the word 
model p(w\L) and the lexical entry model p(x t \w, L), while the joint probability 
p{x t ,y v \(j)) is given by the transducer <fi. Our training paradigm simultaneously 
optimizes the parameters of all three models on the training corpus. In order to 
better understand the contribution of each model to the overall success of our 
joint model, we repeated our experiments while alternately holding the word 
and lexical entry models fixed. In all experiments the word model p(w\L) and 
the lexical entry model p(x t \w,L) are initialized uniformly. Our results are 
presented in the following four tables. 

Fix p(w\L), Fix p(x t \w, L). 





Leven- 


Stochastic Distance 


Viterbi Distance 




shtein 


Tied 


Untied 




Mixed 


Tied 


Untied 




Mixed 


El 


48.04 


45.16 


42.44 


42.54 


45.20 




42.42 




42.53 


E2 


33.00 


31.14 




28.99 




29.16 


31.22 


29.01 




29.16 


E3 


61.87 


68.98 




60.12 




64.78 


68.98 


60.13 


64.77 


E4 


56.35 


64.35 




54.66 




57.61 


64.35 




54.66 




57.61 



Adapt p(w\L), Fix p(x t \w,L). 





Leven- 


Stochastic Distance 


Viterbi Distance 




shtein 


Tied 


Untied 


Mixed 


Tied 


Untied 


Mixed 


El 


48.04 


20.91 


18.61 


18.74 


20.88 




18.58 


18.73 


E2 


33.00 


19.56 




17.14 




17.35 


19.63 


17.17 


17.36 


E3 


61.87 


40.55 




35.13 




38.39 


40.54 


35.14 


38.39 


E4 


56.35 


35.18 




27.57 




27.64 


35.18 




27.57 


27.64 



Fix p(w\L), Adapt p(x t \w,L). 
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Leven- 


Stochastic Distance 


Viterbi Distance 




shtein 


Tied 


Untied 


Mixed 


Tied 


Untied 


Mixed 


El 


48.04 


48.60 


46.85 




45.69 


48.66 


47.07 


45.84 


E2 


33.00 


30.99 




26.67 




28.51 


31.06 


26.68 


26.80 


E3 


61.87 


42.45 




36.13 




40.34 


42.45 


36.14 


40.34 


E4 


56.35 


36.86 




27.51 




34.71 


36.86 




27.51 


34.71 



Adapt p(w\L), Adapt p(x t \w, L). 





Leven- 


Stochastic Distance 


Viterbi Distance 




shtein 


Tied 


Untied 


Mixed 


Tied 


Untied 


Mixed 


El 


48.04 


20.87 


18.61 


18.74 


20.87 




18.58 




18.73 


E2 


33.00 


19.56 




17.14 




17.35 


19.63 


17.16 


17.35 


E3 


61.87 


14.60 


14.29 




14.28 




14.58 


14.29 




14.28 




E4 


56.35 




9.34 




9.36 


9.36 




9.34 




9.36 


9.36 



For experiment El, a uniform word model severely reduces recognition per- 
formance. We believe this is because 57,269 of the 66,284 the words in the El 
lexicon (84.4%) do not appear in either the training or testing corpora. Adapt- 
ing the word model reduces the effective size of the lexicon to the 8,570 words 
that appear in the training corpora, which significantly improves performance. 

For experiments El and E2, adapting the lexical entry model has almost no 
effect, simply because the average number of entries per word is 1.07 in the El 
and E2 lexicons. 

For experiments E3 and E4, adapting the word model alone is only slightly 
more effective than adapting the lexical entry model alone. Adapting either 
model alone reduces the error rate by nearly one half when compared to keeping 
both models fixed. In contrast, adapting both models together reduces the error 
rate by one fifth to one sixth when compared to keeping both models fixed. Thus, 
there is a surprising synergy to adapting both models together: the improvement 
is substantially larger than one might expect from the improvement obtained 
from adapting the models separately. 

Current speech recognition technology typically employs a sparse pronounc- 
ing lexicon of hand-crafted underlying forms and imposes a uniform distribu- 
tion on the underlying pronunciations given the words. When the vocabulary 
is large or contains many proper nouns, then the pronouncing lexicon may be 
generated by a text-to-speech system ||l| . Our results suggest that a significant 
performance improvement is possible by employing a richer pronouncing lexi- 
con, constructed directly from observed pronunciations, along with an adapted 
lexical entry model. 

This tentative conclusion is supported by Riley and Ljolje p2f , who show an 
improvement in speech recognizer performance by employing a richer pronuncia- 
tion model than is customary. Our approach differs from their approach in three 
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important ways. Firstly, our underlying pronouncing lexicon is constructed 
directly from the observed pronunciations, without any human intervention, 
while their underlying lexicon is obtained from a hand-built text-to-speech sys- 
tem. Secondly, our probability model p(y v \w) assigns nonzero probability to 
infinitely many surface forms, while their "network" probability model assigns 
nonzero probability to only finitely many surface forms. Thirdly, our use of the 
underlying form hidden variable means that our model can represent 

arbitrary (nonlocal) dependencies in the surface forms, which their probability 
model cannot. 



5 Conclusion 

We explain how to automatically learn a string distance directly from a corpus 
containing pairs of similar strings. We also explain how to automatically learn a 
string classifier from a corpus of labeled strings. We demonstrate the efficacy of 
our techniques by correctly recognizing over 85% of the unseen pronunciations 
of syntactic words in conversational speech. The success of our approach argues 
strongly for the use of stochastic models in pattern recognition systems. 
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A An Ad-Hoc Solution 



In this appendix we report experimental results for a simple but ad-hoc solution 
to the pronunciation recognition problem based on the classic "nearest neighbor" 
decision rule. Here we learn a string distance using all valid pairs (x T ,y Vi ) of 
underlying forms x T € L{wi) and surface realizations y Vi for each word Wi in 
the training corpus. For each phonetic string y Sj in the testing corpus C , we 
return the word Vj in D that minimizes the string distance cZ(cc' , y Sj ) among all 
lexical entries (v,x l ) £ L. 

Our results are presented in the following table. The most striking property 
of these results is how poorly the trained transduction distances perform relative 
to the simple Levenshtein distance, particularly when the pronouncing lexicon 
is derived from the corpus (experiments E3 and E4). 





Leven- 


Stochastic Distance 


Vitcrbi Distance 




shtein 


Tied 


Untied 


Mixed 


Tied 


Untied 


Mixed 


El 


48.04 


48.40 


46.81 


46.96 


48.39 




46.79 




46.94 


E2 


33.00 


33.55 


32.58 


31.82 


33.69 




31.59 




31.81 


E3 




61.87 




63.05 


62.28 


62.49 


63.13 


62.04 


62.47 


E4 




56.35 






56.35 




59.01 


57.63 


56.35 


59.02 


57.69 



Table 1: Word error rate for seven string distance functions in four experiments. 
This table shows the word error rate after the tenth EM iteration. None of the 
transduction distances is significantly better than the untrained Levenshtein 
distance in this approach. 

We believe that the poor performance of our transduction distances in these 
experiments is due to the crudeness of the ad-hoc training paradigm. The 
handcrafted lexicon used in experiments El and E2 contains only 1.07 entries 
per syntactic word. In contrast, the lexicons derived from the corpus contain 
more than 2.5 entries per syntactic word. These entries can be quite dissimilar, 
and so our ad-hoc training paradigm trains our transduction distances on both 
similar and dissimilar strings. The results presented in section |4~3| confirm this 
hypothesis. And the poor results obtained here with an ad-hoc approach justify 
the more sophisticated approach to string classification pursued in the body of 
the report (section ||) . 
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B Conditioning on String Lengths 



In the main body of this report, we presented a probability function on string 
pairs qua equivalence classes of terminated edit sequences. In order to create a 
valid probability function on edit sequences, we allowed our transducer to gen- 
erate a distinguished termination symbol #. A central limitation of that model 
is that the probability p{n\<p) of an edit sequence length n must decrease expo- 
nentially in n. Unfortunately, this model is poorly suited to linguistic domains. 
The empirical distribution of pronunciation lengths in the Switchboard corpus 
fails to fit the exponential model. 

In this appendix, we present a parameterization of the mcmoryless trans- 
ducer 9 without a termination symbol. This parameterization allows us to more 
naturally define a probability function p(-, -\9, T, V) over all strings of lengths 
T and V. Thus, unlike the probability function defined in the main body of 
this report, summing p(-, - \6, T, V) over all pairs of strings in A T x B v will re- 
sult in unity. This conditional probability may be extended to joint probability 
p(x T , y v \0) on string pairs by means of an arbitrary joint probability p(T, V) 
on string lengths. 

p{x T ,y v \0) = p(x T ,y v \0,T,V)p(T,V) 

As we shall see, the approach pursued in the body of the report has the ad- 
vantage of a simpler parameterization and simpler algorithms. A second differ- 
ence between the two approaches is that in the former approach, the transducer 
learns the relative lengths of the string pairs in the training corpus while in 
the current approach it cannot. In the current approach, all knowledge about 
string lengths is represented by the probability function p(T, V) and not by the 
transducer 6. 

We briefly considered an alternate parameterization of the transducer, 

p(z n #\6)=P(z n \0)p(n) 

with an explicit distribution p(n) on edit sequence lengths, that need not assign 
uniformly decreasing probabilities to n. The principal disadvantage of such 
an approach is that it signficantly increases the computational complexity of 
computing p(x T , y v \<j>). We can no longer collapse all partial edit sequences 
that generate the same prefix (x t ,y v ) of the string pair (x T 1 y v ) because these 
edit sequences may be of different lengths. As a result the dynamic programming 
table for such a model must contain 0(T ■ V ■ (T + V)) entries. In contrast, the 
approach that we pursue in this appendix only admits 0(T ■ V) distinct states. 

We begin by presenting an alternate parameterization of the memoryless 
transducer, the transition probability S(-) is represented as the product of the 
probability of choosing the type of edit operation (insertion, deletion, or sub- 
stitution) and the conditional probability of choosing the symbol(s) used in the 
edit operation. This alternate parameterization has the virtue of providing a 
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probability function on any set of string pairs of a given length. Finally we 
present algorithms that generate, evaluate, and learn the parameters for finite 
strings, conditioned on their lengths. 

B.l Parameterization 

A factored memoryless transducer 9 — (A,B,lo,S) consists of two finite alpha- 
bets A and B as well as the triple lo = (w^, Wj, w s ) of transition probabilities 
and the triple 5 = (Sd 7 S i} S s ) of observation probabilities. uj s is the probability 
of generating a substitution operation and S s (a, b) is the probability of choosing 
the particular symbols a and b to substitute. Similarly, Ud is the probability 
of generating a deletion operation and 6d(a) is the probability of choosing the 
symbol a to delete, while u>i is the probability of generating a insertion operation 
and Si(b) is the probability of choosing the symbol b to insert. 

The translation from our factored parameterization 8 = {A, B,u>,$) back to 
our unfactored parameterization = (A, B, 5) is straightforward. 

6(a,e) = u)dSd(a) 
S(e,b) = uJidi{b) 
5(a,b) = uj s 6 s (a,b) 

The translation from the unfactored parameterization to the factored pa- 
rameterization is also straightforward. 

5{a,e)/uj d 
S(e,b)/oJi 
S(a, b)/oj s 

As explained below, the factored parameterization is necessary in order to prop- 
erly accumulate expectations when the expectation maximization algorithm is 
conditioned on the string lengths. 

B.2 Generation 

A factored memoryless transducer 9 — (A, B, u>, 5) induces a probability function 
p(-, - \9, T, V) on the joint space A T x B v of all pairs of strings of length T and V. 
This probability function is defined by the following algorithm, which generates 
a string pair (x T ,y v ) from the joint space A T x B v according to p(-\8, T, V). 

GENERATE-STRINGS(T,y,#) 

1. initialize t := 1; v := 1; 



Sd(a) = 

U>i = 

Si(b) 

L0 S = 

6 3 (a,b) = 



2G 



2. while t < T and v < V 

3. pick (a,b) from E according to S(-) 

4. if (a g A) then a;* := a; t := t + 1; 

5. if (6 g B) then j/„ := &; v := v + 1; 

6. while t < T 

7. pick a from A according to 6<i(-) 

8. £ t := a; t := t + 1; 

9. while v < V 

10. pick 6 from B according to Si(-) 

11. y v := b; v := v + 1; 

12. return((a; T ,y v }); 

The GENERATE-STRINGS() algorithm begins by drawing edit operations from 
E according to the edit probability S(-) until at least one of the partial strings 
x f and y v is complete [lines 2-5]. If y v is complete but x l is incomplete, then we 
complete x l using symbols drawn from A according to the marginal probability 
Sd(') = 5('\Ed) [lines 6-8]. Conversely, if x* is complete but y v is incomplete, 
then we complete y v using symbols drawn from B according to the marginal 
5i(-) - 6(-\Ei) [lines 9-11]. 



B.3 Evaluation 

The marginal probability p(x T ,y v \9,T, V) of a pair of strings is calculated by 
summing the joint probability p(x T , y v , z n \8, T, V) over all the edit sequences 
that could have generated those strings 

P (x T ,y v \6,T,V) = Y, z ^E*p(x T ,y v ,z n \e,T,v) 

= E z n eE , p{x T ,y v \6, T, V, z n )p(z n \e, T, V) 

= E {z n.. viz n ) = {xT , yV)}P (z n \e,T,V) 

because p(x T , y v \6, T, V, z n ) is nonzero if and only if v(z n ) = (x T , y v ). By the 
definition of conditional probability, 

P (z n \e,T,v) = l[p(z i \e,T,v,z i - 1 ). 

i 

By the definition of the memoryless GENERATE-STRINGsQ function, the condi- 
tional probability p(zi\6, T, V, z l ~ l ) of the edit operation Zi depends only on the 
relationship between the string lengths T, V and the state (t, v) of the incomplete 
edit sequence z 1 ^ 1 . 



p(zi\e,T,V,(t,v)) 



uj s S s (a, b) 


if t < T A v 


<V Azi 


= (a,b) 




if t < T A v 


< V Azi 


= 


uiiSiib) 


if t < T A v 


< V Azi 


= (e,b) 


5d{a) 


if t < T A v 


= V Az t 


= <«>e) 


k(b) 


if t = T A v 


< V Azi 


= {e,b) 





otherwise 







(6) 
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Note that the corresponding transduction distance functions 

d g (x T ,y v \T,V) = -logargm a x {zn:p{zn)={xT , yV)} {p(z n \6,T,V)} 
d' e (x T ,y v \T,V) = -logp(x T ,y v \e,T,V) 

are now conditioned on the string lengths, and therefore are finite-valued only 
for strings in A T x B v . 

The following algorithms calculate the probability p(x T , y v \0, T, V) in quadratic 
time and space 0(T ■ V). The space requirements of the algorithm may be 
straightforwardly reduced to 0(min(T, V)). The only difference between these 
versions and their unconditional variants in the body of the report is that con- 
ditioning on the string lengths requires us to use the conditional probabilities 
Sd(-) and Si(-) instead of the edit probabilities S(-) when a given hidden edit 
sequence has completely generated one of the strings. 

The following algorithm calculates the forward values. The forward variable 
a t ,v contains the probability p(x f ,y v , (t, v)\9, T, V) of passing through the state 
(t,v) and generating the string prefixes x l and y v . 

FORWARD-EVALUATE-STRINGS(x T ,J/ y 

1. a ,o := 1; 

2. for t = 1 . . . T [ a t ,o ■= Ud8d(xt)oit-i,o; ] 

3. for v = 1 . . .V [ a QtV := ui i 8 i (y v )a 0yV ^ 1 ; ] 

4. for t = 1 . . . T - 1 

5. For v = 1 . . . V - 1 

6. ctt,v ■= ^sS a (xt,y v )oit-i,v-i + cJdSd{xt)a t ~i, v + uJi5i(y v )a t , v -i; 

7. fort=l...T-l 

8. a t y ■= uj s 8 s (x t ,yv)a t -i.v-i + 5d{x t )a t -i.v + ^A{yv)at,v-i\ 

9. for v = 1 ... V - 1 

10. a TiV := lo s S s (x t , y v )a T -i, v -i + u d 5 d (x T )a T _ liV + 8i(y v )a T ,v-i; 

11. olt.v ■= u s S s (x T , yv)(*T-i,v-i + 5d(x T )a T -i,v + Si(y v )a T ,v-i, 

12. return(a); 

The following algorithm calculates the backward values. The backward vari- 
able 0t,v contains the probability p(xf +1 ,y^ +1 \0,T,V, (t,v)) of generating the 
string suffixes xj +1 and y^ +1 > from the state (t,v). 

BACKWARD-EVALUATE-STRINGS(x T ,y V ,8) 

1. 0t,v:=1\ 

2. fori = T-1...0 [fry := 5 d (x t+1 )f3 t+liV ; ] 

3. for v = V - 1 . . .0 [ 0r,v ■= Si(y v+1 )(3 T ,v+i: ] 

4. for t = T - 1 . . . 

5. for v = V - 1 ... 

6. Pt,v ■= Us8s(xt+i,y v +i)0t+i,v+i +Ud8d(xt+i)Pt+i,v +Ui5i(yv+i)fk,v+i; 

7. return(/3); 



28 



Observe that at, v Pt,v is probability p{x T , y v , (t, v}\9, T, V) of generating the 
string pair (x T , y v ) by an edit sequence that passes through the state (t, v). 

B.4 Estimation 

The principal difference between the two expectation step algorithms is that 
expectation-STEP-STRINGS() must accumulate expectations for the w and S 
parameter sets separately, via the \ an d 7 variables, respectively. Due to the 
definition (|6[) of p(zi\6,T, V,t, v) above, we may only accumulate expectations 
for the uj transition parameters when no transitions are forced. 

EXPECTATION-STEP-STRINGS(x T ,y V ,0,X,7) 



1. a := FORWARD-EVALUATE-STRPNGS(a; T ,y y ,0); 

2. P :— BACKWARD-EVALUATE-STRINGS(x T ,?/ Vr ,0); 

3. fort= 1...T-1 

4. for v = 1 . . . V - 1 

5. m s := at-i,v-iU s 5 s (xt,y v )/3 ttV /a T ,V, 

6. ~f s (x tl y v ) += m s ; Xs += m s ; 

7. m d := a t -i, v ^dSd(xt)(3t,v/(XT,v; 

8. ld{x t ) += m d ; Xd += m d ; 

9. rrii := a t ,v-\Uih{yv)Pt,v / oit,v\ 

10. liiyv) += mf, Xi += m l ; 



11. for t = 1...T - 1 [ 7d(x t ) += a f _i ! y(5 d (a;t)/3 t ,y/aT,y; ] 

12. for u = 1 . . . V - 1 [ 7t(y„) += a T ,v-iSi(y v )/3T,v/aT,V, ] 

Recall that that ar,v and /3o,o both contain p(x T , y v |^, T, F). Line 5 calcu- 
lates the posterior probability that we were in state (t — l,v — 1) and emitted 
a (xt,y v ) substitution operation. Line 6 accumulates expectations for the lo s 
parameter in the Xs variable, and for the S s (x t ,y v ) parameter in the J s (xf,y v ) 
variable. Lines 7-8 accumulate the posteriori probability that we were in state 
(t — 1, v) and emitted a (x t ,e) deletion operation. Similarly, lines 9-10 accumu- 
late the posteriori probability that we were in state (t, v — 1) and emitted a (e, y v ) 
insertion operation. Lines 11 and 12 accumulate the corresponding posteriori 
probabilities for forced deletion and insertion transitions, respectively. Note 
that no expectations are accumulated for LOd or u>i in lines 11 and 12 because 
these events do not on forced transitions. 

Given the expectations of our transition parameters and observation pa- 
rameters, the following MAXiMiZATiON-STEP-STRiNGS() algorithm updates our 
model parameters. 

MAXIMIZATION-STEP-STRINGS(0,X,7) 

1. N := Xd + Xi + Xs: 

2. uj d := Xd/N; uj { := Xt/ N '> ■= Xs/N; 

3. N d := 0; forall a in A [ N d + = j d (a); ] 
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4. forall a in A [ S d (a) := j d (a)/N d ; ] 

5. Ni := 0; forall b \n B [ Ni+ = 7,(6); ] 

6. forall 6 in _B [ := ji(b)/Ni; ] 

7. iV s := 0; forall (a, 6) in A x S [ 7V S + = j s (a, b); ] 

8. forall (a,b) in A x B [ <5 5 (a,6) := 7 s (a, &)/iV s ; ] 
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